Improvements to a Roll-Back Mechanism for Asynchronous Checkpointing and Recovery

نویسنده

  • Monika Kapus-Kolar
چکیده

Gupta, Rahimi and Yang recently proposed a novel recovery algorithm for distributed systems in which checkpoints are taken asynchronously [1]. A checkpoint taken by a process is a snapshot of its local state, stored in a stable storage, so that the process can roll back to it, if this becomes necessary. The start of a process is also one of its checkpoints. Asynchronous checkpointing means that processes take their checkpoints independently. A failure in a distributed system in principle requires that all its constituent processes roll back, to a global state from which the system can resume its operation as if it had started from it, i.e., to a globally consistent set of local checkpoints (GCSLC), ideally to the so called maximum GCSLC, in which every local checkpoint is as recent as possible. In the case of asynchronous checkpointing, when a failure occurs, the processes have yet to find the maximum GCSLC. They do that by running a checkpoint coordination algorithm (CCA). Alternatively, the processes might agree to restart from the GCSLC which they currently treat as the starting state of the system, i.e., from the current recovery line. This is the most recently computed maximum GCSLC, initially the actual starting state of the system. [1] suggests that the processes occasionally initiate a CCA just for advancing the recovery line. In this paper, we demonstrate that, because of a subtle logical flaw, the CCA of [1] sometimes returns a checkpoint set which is not globally consistent. We correct the flaw and also suggest some other improvements. The rest of the paper is organized as follows. In the next section, we describe the system and the testing of checkpoint consistency and give a brief outline of the CCA of [1]. In Section 3, we explain and correct the flaw. In Section 4, we suggest several other improvements of the CCA. A detailed specification of the improved CCA is given in Section 5. Section 6 suggests that checkpointing and recoveryline advancing in the absence of failures should be more flexible. As the proposed CCA correction increases the usage of the local stable storages, we in Section 7 suggest how to organize them. Section 8 comprises a discussion and conclusions.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Novel Roll-Back Mechanism for Performance Enhancement of Asynchronous Checkpointing and Recovery

In this paper, we present a high performance recovery algorithm for distributed systems in which checkpoints are taken asynchronously. It offers fast determination of the recent consistent global checkpoint (maximum consistent state) of a distributed system after the system recovers from a failure. The main feature of the proposed recovery algorithm is that it avoids to a good extent unnecessar...

متن کامل

A Low Overhead Recovery Technique Using Quasi-Synchronous Checkpointing

In this paper, we propose a quasi-synchronous checkpointing algorithm and a low-overhead recovery algorithm based on it. The checkpointing algorithm preserves process autonomy by allowing them to take checkpoints asynchronously and uses communication-induced checkpoint coordination for the progression of the recovery line which helps bound rollback propagation during a recovery. Thus, it has th...

متن کامل

Comprehensive Low-overhead Process Recovery Based on Quasi-synchronous Checkpointing

In this paper, we propose a low-overhead recovery algorithm based on a quasi-synchronous checkpointing algorithm. The checkpointing algorithm preserves process autonomy by allowing them to take checkpoints asynchronously and uses communication-induced checkpoint coordination for the progression of the recovery line which helps bound rollback propagation during a recovery. Thus, it has the easen...

متن کامل

Checkpoint and Rollback in Asynchronous Distributed Systems

This paper proposes a novel algorithm for taking checkpoints and rolling back the processes for recovery in asynchronous distributed systems. The algorithm has the following properties: (1) Multiple processes can simultaneously initiate the checkpointing. (2) No additional message is transmitted for taking checkpoints. (3) A set of local checkpoints taken by multiple processes denotes a consist...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Informatica (Slovenia)

دوره 33  شماره 

صفحات  -

تاریخ انتشار 2009